Take-home Exercise 1: Geospatial Analytics for Public Good

Published

November 23, 2023

Modified

November 30, 2023

1 Overview

1.1 Background

As urban infrastructures, including public transportation systems like buses, taxis, mass rapid transit, public utilities, and roads, become increasingly digitised, the data generated becomes a valuable resource for tracking the movements of people and vehicles over space and time. This transformation has been facilitated by pervasive computing technologies such as Global Positioning System (GPS) and Radio Frequency Identification (RFID) tags on vehicles. An example of this is the collection of data on bus routes and ridership, amassed from the use of smart cards and GPS devices available on public buses.

The data collected from these sources is likely to contain valuable patterns that offer insights into various aspects of human movement and behavior within a city. Analyzing and comparing these patterns can provide a deeper understanding of urban mobility. Such insights can be instrumental in improving urban management and can also serve as valuable information for both public and private sector stakeholders involved in urban transportation services. This information can aid them in making informed decisions and gaining a competitive edge in their respective domains.

1.2 Objectives

The objective of this study is to utilize appropriate Local Indicators of Spatial Association (LISA) and Emerging Hot Spot Analysis (EHSA) techniques to uncover spatial and spatio-temporal mobility patterns among public bus passengers in Singapore.

This will include the following tasks:

  1. Geovisualisation and Analysis:
    • Compute the passenger trips generated by origin at the hexagon level
    • Display the geographical distribution of the passenger trips
    • Explore spatial patterns revealed by the geovisualisation
Peak hour period Bus tap on time
Weekday morning peak 6am to 9am
Weekday afternoon peak 5pm to 8pm
Weekend/holiday morning peak 11am to 2pm
Weekend/holiday evening peak 4pm to 7pm
  1. Local Indicators of Spatial Association (LISA) Analysis - Compute LISA of the passenger trips generate by origin - Display and draw statistical conclusions of LISA maps
  2. Emerging Hot Spot Analysis (EHSA)
    • Perform Mann-Kendall Test by using the spatio-temporal local Gi* values
    • Display EHSA maps of the Gi* values, describe the spatial patterns revealed

2 Loading Packages

The following packages will be used for this exercise:

  • sf for importing, managing, and handling geospatial data,
  • tidyverse for non-spatial data handling,
  • sfdep will be used to compute spatial weights, global and local spatial autocorrelation statistics, and
  • tmap for thematic mapping.

The code chunk below, using p_load function of the pacman package, ensures that sfdep, sf, tmap and tidyverse packages are installed and loaded in R.

pacman::p_load(sf, sfdep, tmap, tidyverse, knitr,
               patchwork, DT, mapview, leaflet, scales)

# -   Creates a package list containing the necessary R packages
# -   Checks if the R packages in the package list have been installed
# -   If not installed, will install the missing packages & launch into R environment.

3 Data Preparation

3.1 The Data

The following data are used for this study:

  • Aspatial:
    • Passenger Volume by Origin Destination Bus Stops for August, September and October 2023, downloaded from LTA DataMall using API.
  • Geospatial
    • Bus Stop Location from LTA DataMall. It provides information about all the bus stops currently being serviced by buses, including the bus stop code (identifier) and location coordinates.
    • hexagon, a hexagon layer of 250m is provided to replace the relative coarse and irregular Master Plan 2019 Planning Sub-zone GIS data set of URA.

3.2 Import & Preparation

3.2.1 Aspatial

::: panel-tabset

Import into R

We will be importing the Passenger Volume by Origin Destination Bus Stops dataset from August to October 2023, downloaded from LTA DataMall by using read_csv() or readr package.

odbus08 <- read_csv("data/aspatial/origin_destination_bus_202308.csv")
#odbus09 <- read_csv("data/aspatial/origin_destination_bus_202309.csv")
#odbus10 <- read_csv("data/aspatial/origin_destination_bus_202310.csv")

Data Exploration

(a) Attributes

glimpse() of the dplyr package allows us to see all columns and their data type in the data frame.

glimpse(odbus08)
Rows: 5,709,512
Columns: 7
$ YEAR_MONTH          <chr> "2023-08", "2023-08", "2023-08", "2023-08", "2023-…
$ DAY_TYPE            <chr> "WEEKDAY", "WEEKENDS/HOLIDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR       <dbl> 16, 16, 14, 14, 17, 17, 17, 17, 7, 17, 14, 10, 10,…
$ PT_TYPE             <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE      <chr> "04168", "04168", "80119", "80119", "44069", "4406…
$ DESTINATION_PT_CODE <chr> "10051", "10051", "90079", "90079", "17229", "1722…
$ TOTAL_TRIPS         <dbl> 7, 2, 3, 10, 5, 4, 3, 22, 3, 3, 7, 1, 3, 1, 3, 1, …
#glimpse(odbus09)
#glimpse(odbus10)

Insights:

  • There are 7 variables in the odbus08 tibble data, they are:
    • YEAR_MONTH: Month in which data is collected
    • DAY_TYPE: Weekdays or weekends/holidays
    • TIME_PER_HOUR: Hour which the passenger trip is based on, in intervals from 0 to 23 hours
    • PT_TYPE: Type of public transport, i.e. bus
    • ORIGIN_PT_CODE: Origin bus stop ID
    • DESTINATION_PT_CODE: Destination bus stop ID
    • TOTAL_TRIPS: Number of trips
  • We also note that values in ORIGIN_PT_CODE and DESTINATON_PT_CODE are in numeric data type.

(b) Unique Bus Stops

n_distinct() of the dplyr package allows us to count the unique bus stops in the data set.

n_distinct(odbus08$ORIGIN_PT_CODE)
[1] 5067

The results reveal that there are 5067 distinct origin bus stops.

Data Wrangling

(a) Convert Data Type

as.factor() can be used to convert the variables ORIGIN_PT_CODE and DESTINATON_PT_CODE from numeric to categorical data type. We use glimpse() again to check the results.

odbus08$ORIGIN_PT_CODE <- as.factor(odbus08$ORIGIN_PT_CODE)
odbus08$DESTINATION_PT_CODE <- as.factor(odbus08$DESTINATION_PT_CODE)

glimpse(odbus08)
Rows: 5,709,512
Columns: 7
$ YEAR_MONTH          <chr> "2023-08", "2023-08", "2023-08", "2023-08", "2023-…
$ DAY_TYPE            <chr> "WEEKDAY", "WEEKENDS/HOLIDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR       <dbl> 16, 16, 14, 14, 17, 17, 17, 17, 7, 17, 14, 10, 10,…
$ PT_TYPE             <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE      <fct> 04168, 04168, 80119, 80119, 44069, 44069, 20281, 2…
$ DESTINATION_PT_CODE <fct> 10051, 10051, 90079, 90079, 17229, 17229, 20141, 2…
$ TOTAL_TRIPS         <dbl> 7, 2, 3, 10, 5, 4, 3, 22, 3, 3, 7, 1, 3, 1, 3, 1, …

Note that both of them are in factor data type now.

(b) Duplicates Check

Before moving on to the next step, it is a good practice for us to check for duplicated records to prevent double counting of passenger trips.

duplicate <- odbus08 %>% 
  group_by_all() %>% 
  filter(n()>1) %>% 
  ungroup()
  
duplicate
# A tibble: 0 × 7
# ℹ 7 variables: YEAR_MONTH <chr>, DAY_TYPE <chr>, TIME_PER_HOUR <dbl>,
#   PT_TYPE <chr>, ORIGIN_PT_CODE <fct>, DESTINATION_PT_CODE <fct>,
#   TOTAL_TRIPS <dbl>

Results confirm that there are no duplicated records found.

(c) Extracting the Study Data

In our study, we would like to know patterns for 4 peak hour periods. Therefore, we can create a new variable period using the ifelse() that states whether an observation occurred during peak period using the code chunk below.

peak <- odbus08 %>%
  # Weekday morning peak
  mutate(period= ifelse(DAY_TYPE=="WEEKDAY" & (TIME_PER_HOUR >= 6 & TIME_PER_HOUR <= 9), "WDM", 
                        # Weekday afternoon peak
                        ifelse(DAY_TYPE=="WEEKDAY" & (TIME_PER_HOUR >= 17 & TIME_PER_HOUR <= 20), "WDA", 
                               # Weekend/holiday morning peak
                               ifelse(DAY_TYPE=="WEEKENDS/HOLIDAY" & (TIME_PER_HOUR >= 11 & TIME_PER_HOUR <= 14), "WEM",
                                      # Weekend/holiday evening peak
                                      ifelse(DAY_TYPE=="WEEKENDS/HOLIDAY" & (TIME_PER_HOUR >= 16 & TIME_PER_HOUR <= 19), "WEE",
                                             # Return off-peak if neither of the peak hour periods
                                             "Off-peak")))))

We can then filter for peak-period data using the newly created period column and aggregate the total trips for each origin bus stop during peak period.

peakperiods <- peak %>% 
  # filter helps to keep records that occurred during period periods
  filter(period !="Off-peak") %>% 
  # aggregate the total passenger trips for each origin bus stop
  group_by(period, ORIGIN_PT_CODE) %>% 
  summarise(TRIPS=sum(TOTAL_TRIPS))

Let’s visualise the proportions of passenger volumes for each peak period.

Show the code
freq<- ggplot(data=peakperiods, 
       aes(x=period,y=TRIPS))+
  geom_bar(stat="identity") +
  theme(legend.position="none")+
  labs(title = "Frequency of Trip for each Peak Period",
      x = "Peak Period",
      y = "Frequency")

freq + scale_y_continuous(labels=label_comma())

We can see that passenger volume on weekdays are much higher than over the weekends/holidays.

Transpose each peak period period as a columns using pivot_wider() of tidyr package will allow us to create further variables at a bus stop level. We replace NA values with 0 to reflect when there are no traffic for certain periods.

peakperiods_wide <- pivot_wider(peakperiods, 
                                names_from = "period", 
                                values_from = "TRIPS")

peakperiods_wide["WDA"][is.na(peakperiods_wide["WDA"])] <- 0
peakperiods_wide["WDM"][is.na(peakperiods_wide["WDM"])] <- 0
peakperiods_wide["WEE"][is.na(peakperiods_wide["WEE"])] <- 0
peakperiods_wide["WEM"][is.na(peakperiods_wide["WEM"])] <- 0

glimpse(peakperiods_wide)
Rows: 5,067
Columns: 5
$ ORIGIN_PT_CODE <fct> 01012, 01013, 01019, 01029, 01039, 01059, 01109, 01112,…
$ WDA            <dbl> 8448, 7328, 3608, 9317, 12937, 2133, 322, 45010, 27233,…
$ WDM            <dbl> 1973, 952, 1789, 2561, 2938, 1651, 161, 8492, 9015, 424…
$ WEE            <dbl> 3208, 2796, 1623, 4244, 7403, 1190, 88, 21706, 11927, 6…
$ WEM            <dbl> 2273, 1697, 1511, 3272, 5424, 1062, 89, 14964, 8278, 61…

Notice that there are 5067 unique origin bus stops.

(d) Variable Transformation

Show the code
# Extract column
distWDM <- peakperiods_wide$WDM
# Calculate mean 
distWDM_mean <- mean(distWDM)

plot_distWDM <- ggplot(
    data = data.frame(distWDM),
    aes(x = distWDM)
  ) +
  geom_histogram(
    bins = 20, 
    color = "#FFFCF9", 
    fill = "black",
    alpha = .3
  ) +
  # Add line for mean
  geom_vline(
    xintercept = distWDM_mean, 
    color = "#595DE5", 
    linetype = "dashed", 
    linewidth = 1
  ) +
  # Add line annotations
  annotate(
    "text", 
    x = 80000, 
    y = 2000,
    label = paste("Mean =", round(distWDM_mean, 3)),
    color = "#595DE5",
    size = 3
  ) +
  labs(
    title = "Weekday Morning Peak",
    x = "Bus Trips",
    y = "Frequency"
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust=1)
  ) +
  scale_x_continuous(labels = label_number(), n.breaks=8)

# Extract column
distWDA <- peakperiods_wide$WDA
# Calculate mean 
distWDA_mean <- mean(distWDA)

plot_distWDA <- ggplot(
    data = data.frame(distWDA),
    aes(x = distWDA)
  ) +
  geom_histogram(
    bins = 20, 
    color = "#FFFCF9", 
    fill = "black",
    alpha = .3
  ) +
  # Add line for mean
  geom_vline(
    xintercept = distWDA_mean, 
    color = "#595DE5", 
    linetype = "dashed", 
    linewidth = 1
  ) +
  # Add line annotations
  annotate(
    "text", 
    x = 110000, 
    y = 2000,
    label = paste("Mean =", round(distWDA_mean, 3)),
    color = "#595DE5",
    size = 3
  ) +
  labs(
    title = "Weekday Afternoon Peak",
    x = "Bus Trips",
    y = "Frequency"
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust=1)
  ) +
  scale_x_continuous(labels = label_number(), n.breaks=8)

# Extract column
distWEM <- peakperiods_wide$WEM
# Calculate mean 
distWEM_mean <- mean(distWEM)

plot_distWEM <- ggplot(
    data = data.frame(distWEM),
    aes(x = distWEM)
  ) +
  geom_histogram(
    bins = 20, 
    color = "#FFFCF9", 
    fill = "black",
    alpha = .3
  ) +
  # Add line for mean
  geom_vline(
    xintercept = distWEM_mean, 
    color = "#595DE5", 
    linetype = "dashed", 
    linewidth = 1
  ) +
  # Add line annotations
  annotate(
    "text", 
    x = 23000, 
    y = 2000,
    label = paste("Mean =", round(distWEM_mean, 3)),
    color = "#595DE5",
    size = 3
  ) +
  labs(
    title = "Weekdend/Holiday Morning Peak",
    x = "Bus Trips",
    y = "Frequency"
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust=1)
  ) +
  scale_x_continuous(labels = label_number(), n.breaks=8)

# Extract column
distWEE <- peakperiods_wide$WEE
# Calculate mean 
distWEE_mean <- mean(distWEE)

plot_distWEE <- ggplot(
    data = data.frame(distWEE),
    aes(x = distWEE)
  ) +
  geom_histogram(
    bins = 20, 
    color = "#FFFCF9", 
    fill = "black",
    alpha = .3
  ) +
  # Add line for mean
  geom_vline(
    xintercept = distWEE_mean, 
    color = "#595DE5", 
    linetype = "dashed", 
    linewidth = 1
  ) +
  # Add line annotations
  annotate(
    "text", 
    x = 29000, 
    y = 2000, 
    label = paste("Mean =", round(distWEE_mean, 3)),
    color = "#595DE5",
    size = 3
  ) +
  labs(
    title = "Weekend/Holiday Evening Peak",
    x = "Bus Trips",
    y = "Frequency"
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust=1)
  ) +
  scale_x_continuous(labels = label_number(), n.breaks=8)

(plot_distWDM | plot_distWDA)/
(plot_distWEM | plot_distWEE)

The distribution of passenger trips for the 4 peak periods appear to be highly skewed to the right. Rescaling our data using log transformation can greatly reduce the skewness.

peakperiods_wider <- peakperiods_wide %>% 
  mutate(logWDM = ifelse(WDM == 0, 0, log(WDM)),
         logWDA = ifelse(WDA == 0, 0, log(WDA)),
         logWEM = ifelse(WEM == 0, 0, log(WEM)),
         logWEE = ifelse(WEE == 0, 0, log(WEE)))

Let’s visualise the distribution of the 4 peak periods again.

Show the code
# Extract column
distlogWDM <- peakperiods_wider$logWDM
# Calculate mean 
distlogWDM_mean <- mean(distlogWDM)

plot_distlogWDM <- ggplot(
    data = data.frame(distlogWDM),
    aes(x = distlogWDM)
  ) +
  geom_histogram(
    bins = 20, 
    color = "#FFFCF9", 
    fill = "#34414E",
    alpha = .3
  ) +
  # Add line for mean
  geom_vline(
    xintercept = distlogWDM_mean, 
    color = "#595DE5", 
    linetype = "dashed", 
    linewidth = 1
  ) +
  # Add line annotations
  annotate(
    "text", 
    x = 10, 
    y = 1000,
    label = paste("Mean =", round(distlogWDM_mean, 3)),
    color = "#595DE5",
    size = 3
  ) +
  labs(
    title = "Weekday Morning Peak",
    x = "Bus Trips",
    y = "Frequency"
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust=1)
  ) 

# Extract column
distlogWDA <- peakperiods_wider$logWDA
# Calculate mean 
distlogWDA_mean <- mean(distlogWDA)

plot_distlogWDA <- ggplot(
    data = data.frame(distlogWDA),
    aes(x = distlogWDA)
  ) +
  geom_histogram(
    bins = 20, 
    color = "#FFFCF9", 
    fill = "#34414E",
    alpha = .3
  ) +
  # Add line for mean
  geom_vline(
    xintercept = distlogWDA_mean, 
    color = "#595DE5", 
    linetype = "dashed", 
    linewidth = 1
  ) +
  # Add line annotations
  annotate(
    "text", 
    x = 10, 
    y = 1000,
    label = paste("Mean =", round(distlogWDA_mean, 3)),
    color = "#595DE5",
    size = 3
  ) +
  labs(
    title = "Weekday Afternoon Peak",
    x = "Bus Trips",
    y = "Frequency"
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust=1)
  )

# Extract column
distlogWEM <- peakperiods_wider$logWEM
# Calculate mean 
distlogWEM_mean <- mean(distlogWEM)

plot_distlogWEM <- ggplot(
    data = data.frame(distlogWEM),
    aes(x = distlogWEM)
  ) +
  geom_histogram(
    bins = 20, 
    color = "#FFFCF9", 
    fill = "#34414E",
    alpha = .3
  ) +
  # Add line for mean
  geom_vline(
    xintercept = distlogWEM_mean, 
    color = "#595DE5", 
    linetype = "dashed", 
    linewidth = 1
  ) +
  # Add line annotations
  annotate(
    "text", 
    x = 10, 
    y = 1000,
    label = paste("Mean =", round(distlogWEM_mean, 3)),
    color = "#595DE5",
    size = 3
  ) +
  labs(
    title = "Weekdend/Holiday Morning Peak",
    x = "Bus Trips",
    y = "Frequency"
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust=1)
  ) 

# Extract column
distlogWEE <- peakperiods_wider$logWEE
# Calculate mean 
distlogWEE_mean <- mean(distlogWEE)

plot_distlogWEE <- ggplot(
    data = data.frame(distlogWEE),
    aes(x = distlogWEE)
  ) +
  geom_histogram(
    bins = 20, 
    color = "#FFFCF9", 
    fill = "#34414E",
    alpha = .3
  ) +
  # Add line for mean
  geom_vline(
    xintercept = distlogWEE_mean, 
    color = "#595DE5", 
    linetype = "dashed", 
    linewidth = 1
  ) +
  # Add line annotations
  annotate(
    "text", 
    x = 10, 
    y = 1000, 
    label = paste("Mean =", round(distlogWEE_mean, 3)),
    color = "#595DE5",
    size = 3
  ) +
  labs(
    title = "Weekend/Holiday Evening Peak",
    x = "Bus Trips",
    y = "Frequency"
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust=1)
  ) 

(plot_distlogWDM | plot_distlogWDA)/
(plot_distlogWEM | plot_distlogWEE)

3.2.2 Geospatial

(a) Bus Stop Shapefile

In this section, we import BusStop shapefile into RStudio using st_read() function of sf package. This data provides the locations of all bus stops as at Q2 of 2023. crs = 3414 ensures coordinate reference system (CRS) is 3414, which is the EPSG code for the SVY21 projection used in Singapore.

busstop <- st_read(dsn = "data/geospatial",
                   layer = "BusStop") %>% 
  st_transform(crs = 3414)
Reading layer `BusStop' from data source 
  `C:\kytjy\ISSS624\Take-Home_Ex\Take-Home_Ex1\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21

The imported shape file is simple features object of sf. From the output, we can see that there are 5161 points with 3 fields, and confirm that the datum SVY21 is correct.

Recall that there are 5067 origin bus stops from the peakperiods_wider table, compared to the 5161 bus stops from LTA’s BusStop shape file. This could be due to timing difference – LTA’s BusStop shapefile is as of July 2023, while peakperiod is based on Aug 2023.

mapview::mapview(busstop)

Note that there are 5 bus stops located outside Singapore, they are bus stops 46239, 46609, 47701, 46211, and 46219.

(b) Hexagon Layer

A hexagonal grid is used to replace the relative coarse and irregular Master Plan 2019 Planning Sub-zone GIS data set of URA. Hexagons have a number of advantages over these other shapes:

  • The distance between the centroid of a hexagon to all neighboring centroids is the same in all directions.
  • The lack of acute angles in a regular hexagon means that no areas of the shape are outliers in any direction.
  • All neighboring hexagons have the same spatial relationship with the central hexagon, making spatial querying and joining a more straightforward process.
  • Unlike square-based grids, the geometry of hexagons are well-structured to represent curves of geographic features which are rarely perpendicular in shape, such as rivers and roads.
  • The “softer” shape of a hexagon compared to a square means it performs better at representing gradual spatial changes.

We first create a hexagonal grid layer of 250m (refers to the perpendicular distance between the centre of the hexagon and its edges) with st_make_grid, and st_sf to convert the grid into an sf object with the codes below.

st_make_grid function is used to create a grid over a spatial object. It takes 4 arguments, they are:

  • x: sf object; the input spatial data

  • cellsize: for hexagonal cells the distance between opposite edges in the unit of the crs the spatial data is using. In this case, we take cellsize to be 250m * 2 = 500m

  • what: character; one of: "polygons", "corners", or "centers"
  • square: indicates whether you are a square grid (TRUE) or hexagon grid (FALSE)
area_hexagon_grid = st_make_grid(busstop, 500, what = "polygons", square = FALSE)

Next, st_sf converts the grid created to sf object while lengths() of Base R is used to calculate the number of grids created.

# Converts grid to sf
hexagon_grid_sf = st_sf(area_hexagon_grid) %>%
  # Assign unique ID to each grid
  mutate(grid_id = 1:length(lengths(area_hexagon_grid)))

We count the number of bus stops in each grid and keep grids with bus stops using the code chunks below.

# Create a column containing the count of bus stops in each grid
hexagon_grid_sf$busstops = lengths(st_intersects(hexagon_grid_sf, busstop))

# Remove if no bus stop in side that grid, ie only keep hexagons with bus stops
hexagon_w_busstops = filter(hexagon_grid_sf, busstops > 0)

Let’s confirm that all bus stops have been accounted for in our hexagon layer.

sum(hexagon_w_busstops$busstops)
[1] 5161

This is in line with the 5161 points of the busstop shapefile.

Lastly, using tm_shape of tmap, we can quickly visualise the results of the hexagon grids we have created.

Show the code
tmap_mode ("view")
hex <- tm_shape(hexagon_w_busstops)+
  tm_fill(
    col = "busstops",
    palette = "Blues",
    style = "cont",
    title = "Number of Bus Stops",
    id = "grid_id",
    showNA = FALSE,
    alpha = 0.6,
    popup.format = list(
      grid_id = list(format = "f", digits = 0)
    )
  )+
  tm_borders(col = "grey40", lwd = 0.7)
hex

(a) Combining Busstop and hexagon layer

Code chunk below populates the grid ID (i.e. grid_id) of hexagon_w_busstops sf data frame into busstop sf data frame.

bs_wgrids <- st_intersection(busstop, hexagon_w_busstops) %>% 
  select(BUS_STOP_N,BUS_ROOF_N,LOC_DESC, grid_id, busstops) %>% 
  st_drop_geometry
Note
  • st_intersection() is used to perform point and polygon overly and the output will be in point sf object.
  • select() of dplyr package is then use to retain only BUS_STOP_N and SUBZONE_C in the busstop_mpsz sf data frame.
  • st_stop_geometry() removes the geometry data to manipulate it like a regular dataframe using tidyr and dplyr functions

Before we proceed, let’s perform a duplicates check on bs_wgrids.

duplicate2 <- bs_wgrids %>%
  group_by_all() %>%
  filter(n()>1) %>%
  ungroup()

duplicate2
# A tibble: 8 × 5
  BUS_STOP_N BUS_ROOF_N LOC_DESC             grid_id busstops
  <chr>      <chr>      <chr>                  <int>    <int>
1 43709      B06        BLK 644                 1904        7
2 43709      B06        BLK 644                 1904        7
3 58031      UNK        OPP CANBERRA DR         2939        7
4 58031      UNK        OPP CANBERRA DR         2939        7
5 51071      B21        MACRITCHIE RESERVOIR    3081        6
6 51071      B21        MACRITCHIE RESERVOIR    3081        6
7 97079      B14        OPP ST. JOHN'S CRES     5037        5
8 97079      B14        OPP ST. JOHN'S CRES     5037        5

Results displayed 4 genuine duplicated records. We remove these to prevent double-counting.

The code chunk below helps retain unique records.

bs_wgrids <- unique(bs_wgrids)

(c) Populate PeakPeriods with Grid Details

We can now append the grid ID from bs_wgrids data frame onto peakperiods_wide data frame. Recall we previously identified 5 bus stops outside Singapore, filter() allows us to exclude the 5 outside Singapore.

origin_grid <- left_join(peakperiods_wider, bs_wgrids,
            by = c("ORIGIN_PT_CODE" = "BUS_STOP_N")) %>% 
  rename(ORIGIN_BS = ORIGIN_PT_CODE) %>% 
  group_by(grid_id) %>% 
  # retains SG bus stops
  filter(!ORIGIN_BS %in% c(46239, 46609, 47701, 46211, 46219))

glimpse(origin_grid)
Rows: 5,076
Columns: 13
Groups: grid_id [1,504]
$ ORIGIN_BS  <chr> "01012", "01013", "01019", "01029", "01039", "01059", "0110…
$ WDA        <dbl> 8448, 7328, 3608, 9317, 12937, 2133, 322, 45010, 27233, 932…
$ WDM        <dbl> 1973, 952, 1789, 2561, 2938, 1651, 161, 8492, 9015, 4240, 5…
$ WEE        <dbl> 3208, 2796, 1623, 4244, 7403, 1190, 88, 21706, 11927, 6221,…
$ WEM        <dbl> 2273, 1697, 1511, 3272, 5424, 1062, 89, 14964, 8278, 6198, …
$ logWDM     <dbl> 7.587311, 6.858565, 7.489412, 7.848153, 7.985484, 7.409136,…
$ logWDA     <dbl> 9.041685, 8.899458, 8.190909, 9.139596, 9.467847, 7.665285,…
$ logWEM     <dbl> 7.728856, 7.436617, 7.320527, 8.093157, 8.598589, 6.967909,…
$ logWEE     <dbl> 8.073403, 7.935945, 7.392032, 8.353261, 8.909641, 7.081709,…
$ BUS_ROOF_N <chr> "B03", "B05", "B04", "B07", "B09", "B08", "TMNL", "B07", "B…
$ LOC_DESC   <chr> "HOTEL GRAND PACIFIC", "ST JOSEPH'S CH", "BRAS BASAH CPLX",…
$ grid_id    <int> 3292, 3292, 3292, 3323, 3354, 3324, 3324, 3292, 3324, 3292,…
$ busstops   <int> 8, 8, 8, 7, 8, 7, 7, 8, 7, 8, 7, 7, 8, 7, 7, 7, 7, 7, 7, 7,…

(d) Retrieve Geometry

origin_gridwgeom <- inner_join(hexagon_w_busstops,
                               origin_grid, 
                           by = "grid_id")
#origin_gridwgeom <- st_as_sf(origin_gridwgeom)
Important

In order to retain the geospatial properties, the left data frame must the sf data.frame (i.e. hexagon_w_busstop).

4 Geovisualisation & Analysis

4.1 Data Classification

Different classification schemes highlight areas with the highest and/or lowest values, while others create classes that result in a more uniform distribution of colors. When data is sharply skewed or has extreme outliers, it’s important to consider whether the goal is to emphasize those areas or to achieve a more even distribution of colors and sizes.

The main methods of data classification are: - Quantile: each class contains an equal number of features. It assigns the same number of data values to each class. There are no empty classes or classes with too few or too many values - Jenks/Natural breaks: seeks clumps of values that are clustered together in order to form categories that may reflect meaningful groupings of areas - Equal: divides the range of attribute values into equal-sized sub-ranges

Since our variable is less skewed after log transformation, we can explore various classification methods for visualization. This approach may reveal interesting insights that were not immediately apparent before.

4.2 Plots

4.2.1 Weekday Morning Peak Period

Show the code
tmap_mode ("view")

plotlogWDM_q <- tm_basemap("CartoDB.Positron") +
  tm_shape(origin_gridwgeom) +
  tm_fill("logWDM", 
          style = "quantile", 
          palette = "Blues",
          title = "Total passenger trips",
          alpha=0.6,
          id="LOC_DESC")

plotlogWDM_q

The grids above are partitioned using the quantile intervals. We can observe that the bus trips are unevenly distributed across Singapore. There are lighter shares of blue (indicating lower levels of ridership) originating from the edges of the country, particularly in the West, while higher levels of ridership in the North region are indicated by the darker shades of blue.

Bus stops nearer to the residential estates appeared to be popular during the weekday morning peak period:

  • West: BLK 821, BLK 252, Sunshine Place
  • North: BLK 314
  • North-East: BLK 477A, BLK 1, BLK 555, BLK 324
  • East: BLK 109, BLK 124, BLK 756

This is likely due to a large number of people commuting from home to their workplaces/schools on weekday mornings.

Higher passenger traffic were noted at the bus stops nearer to MRT stations such as Harbourfront Station, Farrer Road Station, Yio Chu Kang Station, and Admirality Station. A possible contributing factor could be the people who are transiting from taking the MRTs to buses to get to their destinations.

Lastly, Woodlands Checkpoint also demonstrated higher ridership. This could potentially be due to the people commuting across the causeway from Malaysia into the Singapore borders.

Show the code
tmap_mode ("view")
plotlogWDM_e <- tm_basemap("CartoDB.Positron") +
  tm_shape(origin_gridwgeom) +
  tm_fill("logWDM", 
          style = "equal", 
          palette = "Blues",
          title = "Total passenger trips",
          alpha=0.6,
          id="LOC_DESC")
  
plotlogWDM_e

The map using equal intervals provided slightly different insights. We noted that the bus stop located near to MRT stations had higher levels of ridership. In particular, more trips originated from Tiong Bahru Station, Buona Vista Station, Tanah Merah Station, Admiralty Station, Harbourfront, and Woodleigh Station. Bus interchanges also appeared to be popular origins, i.e. Bukit Panjang Interchange and Joo Koon Interchange.

In general, more homogeneity is noted using the equal interval – the contrast between hexagon to hexagon is less obvious.

Show the code
tmap_mode ("view")
plotlogWDM_j <- tm_basemap("CartoDB.Positron") +
  tm_shape(origin_gridwgeom) +
  tm_fill("logWDM", 
          style = "jenks", 
          palette = "Blues",
          title = "Total passenger trips",
          alpha=0.6,
          id="LOC_DESC")
  
plotlogWDM_j

Using Jenk’s partitioning method, the results were largely similar to the other two types of interval classes. Higher bus ridership were spotted at bus stops within close proximity to MRT stations (Kranji Station, Buona Vista Station, Buangkok Station, Ranggung Station, Farrer Road Station, Stevens Station, Bedok Reservoir Station) and residential estates (Sunshine Place near Tengah, BLK 109 in Bedok, BLK 477A in Sengkang, Bef. BLK 629A in Woodlands, to name a few).

4.2.2 Weekday Afternoon Peak Period

Show the code
tmap_mode ("view")
plotlogWDA_q <- tm_basemap("CartoDB.Positron") +
  tm_shape(origin_gridwgeom) +
  tm_fill("logWDA", 
          style = "quantile", 
          palette = "Reds",
          title = "Total passenger trips",
          alpha=0.6,
          id="LOC_DESC")
  
plotlogWDA_q

A look at the weekday afternoon ridership using the quantile classification yielded the following insights.

  • Ridership from Woodlands Checkpoint remained high.
  • Bus stops in close proximity to MRT tations and popular bus stops in residential estatements remained high.
  • More trips originating from institutional areas: Opposite Ngee Ann Poly, Temasek Poly, NIE BLK 2, School of the Arts
  • More trips originating from industrial buildings/business parks: North Link Bldg, Aft Senoko Way, Mapletree Business City, Woodlands Auto Hub, Opp Airline Hse, etc.
  • More trips originating from hospitals: Yishun Community Hospital, Changi General Hospital
  • Seletar Camp also looked to have high passenger levels
  • The far West seemed to experience low ridership other than the bus stop opposite Tuas Link Station.
  • Southern part of Singapore, consisting of more commercial areas, appeared to be more clustered as illustrated by the density of the darker red hexagons, compared to the weekday morning peak period.
Show the code
tmap_mode ("view")
plotlogWDA_e <- tm_basemap("CartoDB.Positron") +
  tm_shape(origin_gridwgeom) +
  tm_fill("logWDA", 
          style = "equal", 
          palette = "Reds",
          title = "Total passenger trips",
          alpha=0.6,
          id="LOC_DESC")
  
plotlogWDA_e

Notably, there were higher concentration of passengers who boarded the bus at Serangoon Station, Harbourfront/VivoCity, Tiong Bahru Station, Admiralty Station, and Punggol Station during weekday afternoons according to the equal interval classification method.

Similar to the map for weekday morning peak period, the equal interval seemed to produce more homogeneous classifications.

Show the code
tmap_mode ("view")
plotlogWDA_j <- tm_basemap("CartoDB.Positron") +
  tm_shape(origin_gridwgeom) +
  tm_fill("logWDA", 
          style = "jenks", 
          palette = "Reds",
          title = "Total passenger trips",
          alpha=0.6,
          id="LOC_DESC")
  
plotlogWDA_j

Jenk’s classification delivered similar insights to the quantile classification, where the higher concentration of ridership can be observed in the Southern downtown areas.

It also highlighted Opp Airline Hse in the far East as a bus stop with high ridership, something not as visible using the equal interval method.

Alternative methods of commute might be more popular in the West and North-West regions illustrated by the lighter shades of red hexagons.

4.2.3 Weekend/Holiday Morning Peak Period

Show the code
tmap_mode ("view")
plotlogWEM_q <- tm_basemap("CartoDB.Positron") +
  tm_shape(origin_gridwgeom) +
  tm_fill("logWEM", 
          style = "quantile", 
          palette = "Blues",
          title = "Total passenger trips",
          alpha=0.6,
          id="LOC_DESC")
  
plotlogWEM_q

Generally, the distribution of bus ridership looks varied across the island.

During the weekend/holiday peak period, the ridership for far West region remained relatively low. Interestingly, the bus trips recorded from Seletar area appeared to have dipped compared to the weekday peak periods. Buses in these industrial areas could be oriented towards work-related travel, thus less common on weekends.

Bus stops nearer to housing estates, shopping malls, and Woodlands Checkpoint demonstrated higher levels of weekend morning ridership.

The bus stops with the highest boarding passengers are Sunshine Place, Opp BLK 271, BLK 252, Aft. Hasanah Mosque, Buona Vista Station, Harbourfront/Vivocity, Admiralty Station, BLK 555, Bedok Reservoir Station, BLK 22, BLK 109.

Show the code
tmap_mode ("view")
plotlogWEM_e <- tm_basemap("CartoDB.Positron") +
  tm_shape(origin_gridwgeom) +
  tm_fill("logWEM", 
          style = "equal", 
          palette = "Blues",
          title = "Total passenger trips",
          alpha=0.6,
          id="LOC_DESC")
  
plotlogWEM_e

Equal interval classification highlighted the following bus stops to have the highest ridership during weekend/holiday morning peak period: Harbourfront/Vivocity, Tiong Bahru Station, Orchard Station/Lucky Plaza, Admirality Station, Aft. Punggol Road.

Show the code
tmap_mode ("view")
plotlogWEM_j <- tm_basemap("CartoDB.Positron") +
  tm_shape(origin_gridwgeom) +
  tm_fill("logWEM", 
          style = "jenks", 
          palette = "Blues",
          title = "Total passenger trips",
          alpha=0.6,
          id="LOC_DESC")
  
plotlogWEM_j

The findings noted with Jenk’s method are similar to the quantile classification.

4.2.4 Weekend/Holiday Evening

Show the code
tmap_mode ("view")
plotlogWEE_q <- tm_basemap("CartoDB.Positron") +
  tm_shape(origin_gridwgeom) +
  tm_fill("logWEE", 
          style = "quantile", 
          palette = "Reds",
          title = "Total passenger trips",
          alpha=0.6,
          id="LOC_DESC")
  
plotlogWEE_q

On weekend/holiday evenings, there seemed to be increased ridership at the bus stops near Changi Airport compared to the other peak periods.

Sunshine Plaza remains one of the most frequented bus stops, exhibiting high ridership across all four peak periods. While the exact reason for this is difficult to pinpoint, it’s possible that the buses stopping here connect to a wide variety of regions, which could explain the high ridership.

Woodlands Checkpoint also demonstrated high levels of ridership across the different peak periods.

Visually, it looks like there are more bus stops with high ridership across Singapore during the weekend/holiday evening peak period. For example, there seem to be an increase in passenger volume at the Tanah Merah Ferry compared to the other peak periods.

Show the code
tmap_mode ("view")
plotlogWEE_e <- tm_basemap("CartoDB.Positron") +
  tm_shape(origin_gridwgeom) +
  tm_fill("logWEE", 
          style = "equal", 
          palette = "Reds",
          title = "Total passenger trips",
          alpha=0.6,
          id="LOC_DESC")
  
plotlogWEE_e

The bus stops with higher traffic seem to be quite consistent across the different peak periods. This includes Woodlands Checkpoint, Kranji Station, Admiralty Station, Serangoon Station, Aft. Punggol Road, Bukit Panjang MRT, Yio Chu Kang Interchange.

Show the code
tmap_mode ("view")
plotlogWEE_j <- tm_basemap("CartoDB.Positron") +
  tm_shape(origin_gridwgeom) +
  tm_fill("logWEE", 
          style = "jenks", 
          palette = "Reds",
          title = "Total passenger trips",
          alpha=0.6,
          id="LOC_DESC")
  
plotlogWEE_j

The findings noted with Jenk’s method are similar to the quantile classification.

5 Spatial Association Analysis

According to Tobler’s First Law of Geography, “everything is related to everything else, but near things are more related than distant things.”

This sub-section will cover the computation of local spatial autocorrelation statistics and spatial complete randomness test for local spatial autocorrelation. The goal of these analyses is to understand whether there are clusters or outliers of bus stop with high or low ridership across Singapore.

5.1 Global Spatial Autocorrelation

5.1.1 Spatial Weights Matrix

To compute the local spatial autocorrelation statistics, we first need to construct a spatial weights of Singapore. Spatial relationships are multi-directional and multi-lateral. We can use spatial weights to define who the neighbours of the spatial units.

There are two common methods of spatial weights: contiguity-based and distance-based.

Contiguity-based: Neighbours share a common boundary, which can be further distinguished between a Rook and a Queen criterion of contiguity. Rook contiguity defines neighbours by the existence of a common edge between two spatial units. In Queen contiguity defines neighbours as spatial units sharing a common edge or a common vertex.

Distance-based: Assign higher weights to pairs of locations that are closer to each other and lower weights to pairs that are further. This can be further distinguished by fixed weighting, adaptive weighting and inverse-distance weighting schemes. Fixed weighting scheme considers two regions are neighbours if they are within a specified distance from one another. For adaptive weighting scheme, each region will have the same number of neighbours. The number of neighbour is specified beforehand, where k = number of neighbours. Inverse distance method considers that the closer two features are in space, the more likely they are to interact/influence each other.

For this study, we will be using distance-based weight matrix as there are areas where bus stops are sparse (such as Lim Chu Kang and Mandai) and isolated (for example, Tanah Merah Ferry, Changi Naval Base, Resort World Sentosa, Marina Bay Cruise Centre). Consequently, contiguity-based matrix may yield many regions with no neighbours, making it not suitable for our analysis.

Step 1: Determine Cut-Off Distance Limit

First step is to determine the upper distance limit to ensure that each hexagon has at least 1 neighbour.

The following functions can be used:

  • st_knn() of sfdep is used to identify neighbors based on k (e.g. k = 8 indicates the nearest eight neighbours). The output is a neighbours list of class nb. If polygon geometry is provided, the centroids of the polygon will be used for calculation.
  • st_nb_dists() of sfdep is used to calculate the nearest neighbour distance. The output is a list of distances for each observation’s neighbors list.
  • unlist() of Base R is then used to return the output as a vector so that the summary statistics of the nearest neighbour distances can be derived.
geo <- sf::st_geometry(origin_gridwgeom)
nb <- st_knn(geo, 
             longlat = TRUE)
dists <- unlist(st_nb_dists(geo, nb))

Step 2: Derive Summary Stats

We can derive summary statistics of the nearest neighbour distances vector (i.e. dists) by using the code chunk below.

summary(dists)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    0.00   22.89    0.00 1000.00 

The maximum nearest neighbour distance is 1000m, thus we will use threshold value of 1001m to ensure each spatial unit has a minimum of 1 neighbour.

Step 3: Compute fixed distance weight

Now we will go ahead to compute the fixed distance weights by using following functions:

st_dists_band() of sfdep is used to identify neighbors based on a distance band (i.e. 1000m). The output is a list of neighbours (i.e. nb). st_weights() is then used to calculate polygon spatial weights of the nb list. Note that the default style argument is set to “W” for row standardized weights, and the default allow_zero is set to TRUE, assigns zero as lagged value to zone without neighbors.

wm_fd <- origin_gridwgeom %>%
  mutate(nb = st_dist_band(geo,
                           upper = 1001),
               wt = st_weights(nb),
               .before = 1)

Step 4: Observations

Show the code
#kable(head(wm_fd,5))
summary(wm_fd$nb)
Neighbour list object:
Number of regions: 5022 
Number of nonzero links: 266698 
Percentage nonzero weights: 1.057466 
Average number of links: 53.10593 
Link number distribution:

  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
  3   6   3   4   7   6  20  19  20  13  24  13  21  29  20  35  28  40  34  40 
 21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
 24  52  45  33  39  48  26  44  31  30  31  52  48  39  35  55  48  49  83  55 
 41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
 74  67  59  73  76  83  96  81  59  53 101 136  87 117 101  84 111  79 134 127 
 61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
114 150  96 113 103 103 106  81  94 107  61  73  92  55  41  64  61  73  42  31 
 81  82  83  84  85  86  87  88  89  90  92  93  94  95  96  97  98  99 100 102 
 40  32   9  30  30  26   9  19  14  21   6   9  12  15   8   4   4   6  13   5 
3 least connected regions:
2274 5021 5022 with 1 link
5 most connected regions:
3699 3700 3701 3702 3703 with 102 links

From the result above, we can confirm that all hexagons have at least one neighbour and there are 5 hexagons with 102 neighbours. We can also identify an average of 53 neighbours per hexagon using the distance-based weight matrix.

A characteristic of fixed distance weights matrix is that more densely settled areas (town, residential neighbourhoods) tend to have more neigbours while less densely settle areas (military camps, industrial estates) tend to have less neighbours. To overcome the issue of fixed distance weight matrix where there is uneven distribution of neighbours, we can directly control the number of neighbours using k-nearest neighbours by setting the value of k in the code chunk below.

As a rule-of-thumb, we will set k = 8 i.e., all hexagons will have 8 neighbours.

wm_ad <- origin_gridwgeom %>% 
  mutate(nb = st_knn(geo,
                     k=8),
         wt = st_weights(nb),
               .before = 1)

head(wm_ad, n=3)
Simple feature collection with 3 features and 16 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 3720.122 ymin: 27925.48 xmax: 4720.122 ymax: 31100.9
Projected CRS: SVY21 / Singapore TM
                           nb
1  2, 4, 5, 9, 10, 15, 32, 33
2  1, 4, 5, 9, 10, 15, 32, 33
3 5, 6, 7, 11, 12, 16, 17, 18
                                                      wt grid_id busstops.x
1 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125      34          1
2 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125      65          1
3 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125      99          1
  ORIGIN_BS WDA WDM WEE WEM   logWDM   logWDA   logWEM   logWEE BUS_ROOF_N
1     25059 417  62  65   5 4.127134 6.033086 1.609438 4.174387        UNK
2     25751 110  50  26  24 3.912023 4.700480 3.178054 3.258097       B02D
3     26379 249  44  54  27 3.784190 5.517453 3.295837 3.988984        NIL
             LOC_DESC busstops.y              area_hexagon_grid
1   AFT TUAS STH BLVD          1 POLYGON ((3970.122 27925.48...
2 BEF TUAS STH AVE 14          1 POLYGON ((4220.122 28358.49...
3            YONG NAM          1 POLYGON ((4470.122 30523.55...

The results show that the weights of the neighbours have been assigned to 1/8 (0.125) of the total weight, representing each of the 8 neighbours.

Inverse distance weights takes into account the decay functions of distance.

We can derive spatial weight matrix based on inverse distance method using the following functions:

  • st_contiguity() of sfdep is used to identify the neighbours by using contiguity criteria. The output is a list of neighbours (i.e. nb).

  • st_inverse_distance() is then used to calculate inverse distance weights of neighbours on the nb list.

wm_idw <- origin_gridwgeom %>%
  mutate(nb = st_contiguity(geo),
         wts = st_inverse_distance(nb, geo,
                                   scale = 1,
                                   alpha = 1),
         .before = 1)

summary(wm_idw$nb)
Neighbour list object:
Number of regions: 5022 
Number of nonzero links: 107808 
Percentage nonzero weights: 0.4274621 
Average number of links: 21.46714 
6 regions with no links:
1750 2274 3159 4675 4989 4994
Link number distribution:

  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19 
  6  12  17  33  23  97  70  99  88  76 115 136 126 133 130 160 168 227 197 169 
 20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39 
159 174 251 240 213 195 222 210 183 187 124 115 104 110 111  39  54  56  11  17 
 40  41  42  43  44  48 
 55  62  13  17  10   8 
12 least connected regions:
1 32 62 736 3354 3355 4982 4990 5002 5003 5021 5022 with 1 link
8 most connected regions:
3048 3049 3050 3051 3052 3053 3054 3055 with 48 links

Using the inverse distance method resulted in 6 regions with no neighbours, this could be due to the spatial isolation of certain hexagons.

In summary:

  • The number of neighbours using fixed distance method vary widely from 1 to 102. Consequently, the uneven distribution could affect the spatial autocorrelation analysis.
  • Inverse distance method led to regions with no neighbours and is computationally intensive as each neighbour
  • Since each hexagon is equally sized, the adaptive distance-based spatial weight matrix would be best suited for our analysis since each centroid can represent each region well.
centroid <- st_centroid(origin_gridwgeom)
plot(origin_gridwgeom$area_hexagon_grid, border = "lightgrey")

#plot(wm_idw, centroid, pch = 19, cex = 0.1, add = TRUE, col = "red")

5.2 Computing Global Spatial Autocorrelation Statistics

5.2.1 Moran’s I

We will perform Moran’s I statistical testing by using global_moran_perm() of spdep. The Global Moran’s I Permutation Test is a statistical method used in spatial analysis to assess the significance of spatial autocorrelation in a dataset. Spatial autocorrelation refers to the degree to which a variable is correlated with itself across space, indicating patterns such as clustering or dispersion.

The Moran I statistic ranges from -1 to 1. If the Moran I is:

  • positive (I>0): Clustered, observations tend to be similar
  • negative (I<0): Disperse, observations tend to be dissimilar
  • approximately zero: observations arranged randomly over space

The code chunk below performs Moran’s I statistical testing, using the null and alternative hypotheses as follows:

H0: The observed spatial patterns of proportion of bus ridership in Singapore are not clustered (i.e. either random or dispersed). H1: The observed spatial patterns of proportion of bus ridership in Singapore are clustered.

A total of 100 simulations will be performed with a seed number 1234. set.seed() function allows us to create reproducible results.

Note: nsim arugment of global_moran_perm() refers to the number of simulations is nsim + 1, i.e., for nsim = 99, 100 simulations will be performed.

set.seed(1234)
Show the code
gmp_WDM <- global_moran_perm(wm_ad$WDM,
                  wm_ad$nb,
                  wm_ad$wt,
                  nsim = 99)
gmp_WDM

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 100 

statistic = 0.094609, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided
Show the code
gmp_WDA <- global_moran_perm(wm_ad$WDA,
                  wm_ad$nb,
                  wm_ad$wt,
                  nsim = 99)
gmp_WDA

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 100 

statistic = 0.063584, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided
Show the code
gmp_WEM <- global_moran_perm(wm_ad$WEM,
                  wm_ad$nb,
                  wm_ad$wt,
                  nsim = 99)
gmp_WEM

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 100 

statistic = 0.095565, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided
Show the code
gmp_WEE <- global_moran_perm(wm_ad$WEE,
                  wm_ad$nb,
                  wm_ad$wt,
                  nsim = 99)
gmp_WEE

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 100 

statistic = 0.083812, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided

Across the 4 peak periods, the permutation test generated low p-values of <0.05. This indicates that we can reject the null hypothesis at the 95% level of confidence, and conclude that for each of the 4 peak periods, the overall bus ridership across Singapore is spatially clustered (since positive Moran’s I value is obtained).

5.2.2 Geary’s C

5.2.3 Spatial Correlogram

5.3 Local Spatial Autocorrelation Statistics

Global spatial autocorrelation provides a broad overview of spatial clustering within a dataset, offering a single value that indicates whether similar values are generally clustered or dispersed across the entire study area. In contrast, local spatial autocorrelation delves into specific locations, identifying where clusters of similar values (hot spots or cold spots) or spatial outliers exist. While global metrics give an overall trend, local metrics provide detailed, location-specific insights, highlighting exact areas of significant spatial clustering or anomaly.

Thus, after we have established through statistical testing that spatial clustering of bus ridership occurs in Singapore, we now seek to detect clusters or outliers and discover if there are any hot or cold spots of high ridership using Local Spatial Autocorrelation Statistics.

5.2.1 Local Moran’s I

In this section, we will perform Moran’s I statistics testing by using local_moran() of sfdep. The output of local_moran() is a sf data.frame, containing the columns below:

  • ii: local moran statistic
  • eii: expectation of local Moran’s I statistic
  • var_ii: variance of local Moran’s I statistic
  • z_ii: standard deviation of local Moran’s I statistic
  • p_ii: p-value of local Moran’s I statistic using pnorm()
  • p_ii_sim: For localmoran_perm(), rank() and punif() of observed statistic rank for [0, 1] p-values using alternative=
  • p_folded_sim: the simulation folded [0, 0.5] range ranked p-value, based on crand.py of pysal
  • skewness: For localmoran_perm, the output of e1071::skewness() for the permutation samples underlying the standard deviates
  • kurtosis: For localmoran_perm, the output of e1071::kurtosis() for the permutation samples underlying the standard deviates.

unnest() of tidyr package helps expand a list-column containing data frames into rows and columns.

Show the code
lisa_WDM <- wm_ad %>% 
  mutate(local_moran = local_moran(
    origin_gridwgeom$WDM, nb, wt, nsim = 99),
         .before = 1) %>%
  unnest(local_moran)

head(lisa_WDM, n=5)
Simple feature collection with 5 features and 28 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 3720.122 ymin: 27925.48 xmax: 4970.122 ymax: 31100.9
Projected CRS: SVY21 / Singapore TM
# A tibble: 5 × 29
     ii      eii var_ii  z_ii  p_ii p_ii_sim p_folded_sim skewness kurtosis
  <dbl>    <dbl>  <dbl> <dbl> <dbl>    <dbl>        <dbl>    <dbl>    <dbl>
1 0.190  0.00735 0.0204 1.28  0.201     0.02         0.01    -2.55     8.37
2 0.190 -0.0100  0.0433 0.962 0.336     0.02         0.01    -5.72    42.9 
3 0.186  0.00312 0.0459 0.853 0.394     0.02         0.01    -6.21    49.1 
4 0.181  0.0166  0.0142 1.38  0.167     0.02         0.01    -2.06     5.79
5 0.167  0.00919 0.0118 1.45  0.147     0.02         0.01    -2.18     7.71
# ℹ 20 more variables: mean <fct>, median <fct>, pysal <fct>, nb <nb>,
#   wt <list>, grid_id <int>, busstops.x <int>, ORIGIN_BS <chr>, WDA <dbl>,
#   WDM <dbl>, WEE <dbl>, WEM <dbl>, logWDM <dbl>, logWDA <dbl>, logWEM <dbl>,
#   logWEE <dbl>, BUS_ROOF_N <chr>, LOC_DESC <chr>, busstops.y <int>,
#   area_hexagon_grid <POLYGON [m]>
lisa_WDA <- wm_ad %>% 
  mutate(local_moran = local_moran(
    origin_gridwgeom$WDA, nb, wt, nsim = 99),
         .before = 1) %>%
  unnest(local_moran)

head(lisa_WDA, n=5)
Simple feature collection with 5 features and 28 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 3720.122 ymin: 27925.48 xmax: 4970.122 ymax: 31100.9
Projected CRS: SVY21 / Singapore TM
# A tibble: 5 × 29
      ii     eii   var_ii  z_ii  p_ii p_ii_sim p_folded_sim skewness kurtosis
   <dbl>   <dbl>    <dbl> <dbl> <dbl>    <dbl>        <dbl>    <dbl>    <dbl>
1 0.0752 0.00345 0.00381  1.16  0.245     0.02         0.01    -1.92     4.44
2 0.0795 0.00807 0.00673  0.870 0.384     0.04         0.02    -3.02    10.4 
3 0.0734 0.00141 0.00811  0.799 0.424     0.08         0.04    -2.87     9.31
4 0.0567 0.0145  0.00114  1.25  0.212     0.02         0.01    -3.00    12.6 
5 0.0415 0.00794 0.000767 1.21  0.225     0.02         0.01    -2.80    11.0 
# ℹ 20 more variables: mean <fct>, median <fct>, pysal <fct>, nb <nb>,
#   wt <list>, grid_id <int>, busstops.x <int>, ORIGIN_BS <chr>, WDA <dbl>,
#   WDM <dbl>, WEE <dbl>, WEM <dbl>, logWDM <dbl>, logWDA <dbl>, logWEM <dbl>,
#   logWEE <dbl>, BUS_ROOF_N <chr>, LOC_DESC <chr>, busstops.y <int>,
#   area_hexagon_grid <POLYGON [m]>
lisa_WEM <- wm_ad %>% 
  mutate(local_moran = local_moran(
    origin_gridwgeom$WEM, nb, wt, nsim = 99),
         .before = 1) %>%
  unnest(local_moran)

head(lisa_WEM, n=5)
Simple feature collection with 5 features and 28 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 3720.122 ymin: 27925.48 xmax: 4970.122 ymax: 31100.9
Projected CRS: SVY21 / Singapore TM
# A tibble: 5 × 29
     ii       eii  var_ii  z_ii  p_ii p_ii_sim p_folded_sim skewness kurtosis
  <dbl>     <dbl>   <dbl> <dbl> <dbl>    <dbl>        <dbl>    <dbl>    <dbl>
1 0.143  0.000449 0.0154  1.15  0.252     0.04         0.02   -2.77     8.66 
2 0.141  0.0126   0.00812 1.43  0.154     0.06         0.03   -0.884    0.201
3 0.130 -0.0145   0.0446  0.684 0.494     0.02         0.01   -3.95    18.0  
4 0.124  0.00189  0.0111  1.16  0.247     0.04         0.02   -2.96    11.3  
5 0.115 -0.00534  0.0161  0.948 0.343     0.02         0.01   -5.03    33.2  
# ℹ 20 more variables: mean <fct>, median <fct>, pysal <fct>, nb <nb>,
#   wt <list>, grid_id <int>, busstops.x <int>, ORIGIN_BS <chr>, WDA <dbl>,
#   WDM <dbl>, WEE <dbl>, WEM <dbl>, logWDM <dbl>, logWDA <dbl>, logWEM <dbl>,
#   logWEE <dbl>, BUS_ROOF_N <chr>, LOC_DESC <chr>, busstops.y <int>,
#   area_hexagon_grid <POLYGON [m]>
lisa_WEE <- wm_ad %>% 
  mutate(local_moran = local_moran(
    origin_gridwgeom$WEE, nb, wt, nsim = 99),
         .before = 1) %>%
  unnest(local_moran)

head(lisa_WEE, n=5)
Simple feature collection with 5 features and 28 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 3720.122 ymin: 27925.48 xmax: 4970.122 ymax: 31100.9
Projected CRS: SVY21 / Singapore TM
# A tibble: 5 × 29
      ii       eii  var_ii  z_ii  p_ii p_ii_sim p_folded_sim skewness kurtosis
   <dbl>     <dbl>   <dbl> <dbl> <dbl>    <dbl>        <dbl>    <dbl>    <dbl>
1 0.0904 -0.0160   0.0277  0.639 0.523     0.04         0.02    -4.96    27.3 
2 0.0924 -0.00619  0.0143  0.824 0.410     0.02         0.01    -2.83     8.51
3 0.0809  0.00535  0.00947 0.777 0.437     0.14         0.07    -2.66     7.65
4 0.0756  0.000552 0.00974 0.760 0.447     0.02         0.01    -4.97    30.6 
5 0.0637  0.000789 0.00272 1.20  0.228     0.04         0.02    -2.01     4.58
# ℹ 20 more variables: mean <fct>, median <fct>, pysal <fct>, nb <nb>,
#   wt <list>, grid_id <int>, busstops.x <int>, ORIGIN_BS <chr>, WDA <dbl>,
#   WDM <dbl>, WEE <dbl>, WEM <dbl>, logWDM <dbl>, logWDA <dbl>, logWEM <dbl>,
#   logWEE <dbl>, BUS_ROOF_N <chr>, LOC_DESC <chr>, busstops.y <int>,
#   area_hexagon_grid <POLYGON [m]>

6 Limitations

Should be analysed in conjunction with other modes of public transportation